Methods and systems of scheduling computer processes or tasks in a distributed system

ABSTRACT

A cloud computer system is provided that includes a plurality of computer devices and a database. The plurality of computer devices execute a plurality of virtual machines, with one of the virtual machines serving as a controller node and the remainder serving as worker instances. The controller node is programmed to accept a request to initiate a distributed process that includes a plurality of data jobs, determine a number of worker instances to create across the plurality of computer devices, and cause the number of worker instances to be created on the plurality of computer devices. The worker instances are programmed to create a unique message queue for the corresponding worker instance, and store a reference for the unique message queue that was created for the corresponding worker to the database. The controller node retrieves the reference to the unique message queues and posts jobs to the message queues for execution by the worker instances.

CROSS REFERENCE(S) TO RELATED APPLICATION(S)

This application claims priority to U.S. Provisional Application No.62/459,722, filed Feb. 16, 2017, the entire contents of which are herebyincorporated by reference. This application also incorporates byreference U.S. Provisional Application No. 62/459,711, filed Feb. 16,2017 and an application titled “SYSTEMS AND METHODS OF RETROSPECTIVELYDETERMINING HOW SUBMITTED DATA TRANSACTION REQUESTS OPERATE AGAINST ADYNAMIC DATA STRUCTURE” (attorney docket number: 4010-424) filed on thesame date of the instant application.

TECHNICAL OVERVIEW

The technology described relates to scheduling computer processes ortasks for computer processes. More particularly, the technologydescribed relates to scheduling computer processes, tasks, or jobs in adistributed environment, such a cloud-based computer system.

INTRODUCTION

Cloud computing technology provides for shared processing and dataresources (collectively computing resources). This technology allows forprovisioning of computing resources on an on-demand basis where clientcomputers can use one to thousands of hardware processors. Individualsand organizations find the flexibility of this technology attractive forhandling data processing that can use a large amount of computingresources.

While cloud computing systems may be used to provide an arbitrary numberof processing instances (e.g., virtual or physical machines), the actualprovisioning of tasks to such resources is typically a static or manualoperation. For example, if 50 different virtual machines are created foranalyzing weather data, a static configuration will need to be developedthat details how those 50 different virtual machines are to be used forthe weather analysis process. Static configurations may work in certainimplementations when the incoming data and number of virtual machines isrelatively constant (e.g., the same amount, same type, etc. . . . ), butmay break down when the amount or type of incoming data is (highly)variable. In other words, while one may be able to create an arbitrarynumber of instances in a cloud-computing environment, to effectively usethose machines, the overall process must know how to communicate withthose virtual machines to instruct them as to what job to perform.

Thus, new techniques for managing or scheduling tasks or jobs in adistributed, dynamic environment, such as a cloud computing environment(e.g., where there may be an arbitrary number of servers available), areneeded. Techniques for providing data or information to an arbitrarynumber of servers and/or the job processes of those servers is alsoneeded.

SUMMARY

In certain example embodiments, a cloud computer system (system) isprovided. The system includes a plurality of computer devices coupledvia an electronic data communications network, with each of theplurality of computer devices having at least one hardware processor andelectronic data storage. Each device is configured to host at least onevirtual machine instance with at least one of the virtual machineinstances configured as a controller instance (e.g., a controller node).The system includes a database accessible by each of the virtual machineinstances. The controller instance is programmed to accept a request toinitiate a distributed process that includes a plurality of data jobsand determine a number of worker instances to create across theplurality of computer devices. Once determined, the controller instancecauses a number of worker instances to be created on the plurality ofcomputer devices (e.g., on the cloud computer system). Each of thecreated worker instances, as part of a initialization process, createstheir own unique message queue and communicates with the database tostore a reference to the message queue in the database. The controllernode is further programmed to read the references to the message queuesfrom the database and publish the data jobs to the messages queues.

This Summary is provided to introduce a selection of concepts that arefurther described below in the Detailed Description. This Summary isintended neither to identify key features or essential features of theclaimed subject matter, nor to be used to limit the scope of the claimedsubject matter; rather, this Summary is intended to provide an overviewof the subject matter described in this document. Accordingly, it willbe appreciated that the above-described features are merely examples,and that other features, aspects, and advantages of the subject matterdescribed herein will become apparent from the following DetailedDescription, Figures, and Claims.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features and advantages will be better and morecompletely understood by referring to the following detailed descriptionof example non-limiting illustrative embodiments in conjunction with thedrawings of which:

FIG. 1 shows an example system creating and scheduling multipledifferent processing instances and corresponding jobs for thoseinstances;

FIG. 2 shows a signal diagram of how the system in FIG. 1 is initializedaccording to certain example embodiments;

FIG. 3 shows how jobs are submitted to worker instances during a reportgeneration phase according to certain example embodiments; and

FIG. 4 shows an example computing device that may be used in someembodiments to implement features described herein.

DETAILED DESCRIPTION

In the following description, for purposes of explanation andnon-limitation, specific details are set forth, such as particularnodes, functional entities, techniques, protocols, etc. in order toprovide an understanding of the described technology. It will beapparent to one skilled in the art that other embodiments may bepracticed apart from the specific details described below. In otherinstances, detailed descriptions of well-known methods, devices,techniques, etc. are omitted so as not to obscure the description withunnecessary detail.

Sections are used in this Detailed Description solely in order to orientthe reader as to the general subject matter of each section; as will beseen below, the description of many features spans multiple sections,and headings should not be read as affecting the meaning of thedescription included in any section.

Overview

Certain example embodiments described herein relate to cloud computingarchitecture and systems. In certain examples, a cloud computer systemincludes hundreds or thousands of different physical computers (e.g., anexample computer system is shown in FIG. 4). When clients and users usethe processing resources of a cloud computer system, they create workerinstances (also referred to as processing instances or instances) thatcan then execute tasks (e.g., jobs that handle data—or data jobs) asdirected by the users/clients. In certain examples, the worker instancesare a combination of virtual machines implemented by the underlyingprocessing resources of the cloud computer system and the set of tasksor jobs that those virtual machines are programmed to execute. Thevirtual machines may be system virtual machines (e.g., that replicatethe functionality of a “real” computer system and are thus fullyvirtualized) or process based virtual machines (e.g., that may executecomputer programs in a platform independent manner). The workerinstances may be created and destroyed as new instances are needed andthe jobs that the instances are tasked with are completed.

In certain example embodiments, a cloud computer system that can createand destroy an arbitrary number of worker instances is provided. Each ofthe worker instances is assigned one or more jobs that receive inputdata, process the input data, and generate output (e.g., in the form ofoutput data or a report of the output data). In certain exampleembodiments, a controller node (e.g., an instance of the cloud computersystem tasked with controller functions) is tasked with creating,destroying, and/or controlling worker instances that will perform jobsbased on input data. The controller node is programmed to request thecreation of the arbitrary number of worker instances. Each of the workerinstances then creates a corresponding input queue (e.g., where therespective worker instance will read input data from). Each workerinstance also writes the location of the correspondingly created queue(e.g., a reference to that queue, such as a URL) and/or a pointer orother reference to the worker instance to a database. The creation ofthe worker instance and its subsequent communication with the databasemay be collectively referred to as the initiation phase of the workerinstance. During the initiation phase, the controller periodicallyqueries the database until all of the requested worker instances havepopulated the details of their respective instance and queue reference.Once the controller node has retrieved the references to the respectivequeues (e.g., the URLs) of the worker instances, it may then task theworker instances with jobs by publishing jobs to the queues associatedwith the worker instances.

In many places in this document, including but not limited to in thebelow descriptions of FIGS. 1-4, software modules and actions performedby software modules are described. This is done for ease of description;it should be understood that, whenever it is described in this documentthat a software module performs any action, the action is in actualityperformed by underlying hardware elements (such as a processor and amemory device) according to the instructions that comprise the softwaremodule. Further details regarding this are provided below in, amongother places, the description of FIG. 4.

Description of FIG. 1:

FIG. 1 shows an example cloud computer system for creating andscheduling multiple different processing instances (sometimes alsoreferred to as “worker instances” or just “instances”) and correspondingjobs for those instances. Cloud computer system 100 includes two maincomponents. The first is the controller node (sometimes referred to as acontroller instance herein) 102 and the second is worker instances 116A,116B, and 116C that are dynamically created in response to a request bythe controller node 102.

Controller node 102 is a virtual machine (or virtual container, or othervirtual environment) instance that operates on one of a plurality ofhardware computer nodes 103 n that make up hardware computing resourcesof the cloud computer system 100. In certain examples, each of hardwarecomputer nodes 103 n are physical servers or computers (e.g., as shownin FIG. 4) that are coupled together via an electronic datacommunications network (e.g., gigabit Ethernet or other types of datacommunications technology)

The controller node 102 includes a controller process 108 that includesprocessing logic (e.g., a computer program) for requesting the creationof new worker instances, handling incoming data, distributing jobs,controlling the progression of the overall process and the progressionof each job being performed by the cloud computing system 100 and theinstances thereof, and the like.

Each worker instance 116A-116C is a combination of a virtual machine(e.g., an instance or container) and a process or thread (the worker)that is executing within that virtual machine. Thus, the instance andthe executing worker are referred to as a worker instance that is ableto process jobs and data posted to the corresponding queues 114A-114C.The worker instances are stateless immutable instances and wait untilwork (e.g., a job) is assigned to them using their respective queues114A-114C. In certain example embodiments, the work is communicated tothe queues in the form of JavaScript Object Notation (JSON) objects,which is a data-interchange format that can be human readable, but alsoeasily parsed by a computer. Other types of data formats may also beused.

An example cloud computer system may be Amazon Web Services (AWS) cloudcomputer system that provides different types of instances depending onthe processing requirements. In certain examples, each instance issupported by underlying processing resources (e.g., such as the computersystem shown in FIG. 4). In certain examples, multiple differentinstances share the same underlying processing resources (e.g., arehandled by the same computer device). In certain examples, theunderlying hardware can be dedicated to a single instance. It will beappreciated that the flexibility of a cloud computer system architectureprovides for a variety of different schemes for handling instances thatare used to complete tasks or jobs.

In certain example embodiments, all of the instances (e.g. the workerinstances and the controller instances) share a common system image(e.g., that is used across all of the different instances—including thecontroller node—created on the cloud computer system). Of the instances,the controller node is a long running instance and the worker instancesare created and terminated by the controller process 108 on thecontroller node. In certain example embodiments, the controller instanceautoscales to a group size of 1. This ensures the cloud computing system100 will always have one instance of the controller instance running. Incertain examples, when a newly created instance is started, it is passed(e.g., as part of its startup) the user data which indicates what role(controller or worker) that instance is to have. If the instance ismarked as a controller node, the controller process may be started.However if the instance is a worker instance, then a service for aworker process is started. This type of deployment architecture may makeit easier as only one image for the instances needs to be created.

The controller node 102 includes or stores a cron job 104, which is ascript or process configured to run at predefined time periods (e.g.,every night at 11 PM). In this example, the cron job 104 is used tosignal the initiation of the process that takes input for a given datethat produces reports for the data for that date. The cron jobcommunicates to the controller process 108 via queue 106. In certainexamples, the queue is implemented using a simple queue service (SQS)that is part of AWS.

Also included in the system 100 is an SQRL repository that provides forauthentication and authorization services for users or clients thatinteract with the system 100. SQRL is an open standard for secure quickreliable login functionality for websites and the like. FTP site 112 isa storage area for storing reports and/or other output generated byusing the system 100. Users can then access the FTP site 112 andretrieve the data (e.g., a generated report).

Report Generator 118 may be another instance (or part of the controllerinstance) that generates a finalized report from the data processed andoutput from the worker instances. For example, each worker instance mayreturn or output a data list, and the report generator 118 may generatea report (e.g., in PDF form or a web page) from the outputted lists withcharts and the like to visually show the result of the processing. Theresults of the report generator 118 and/or the output from the workerinstances 116A-116C may be stored in intermediate storage 120. Database122 and/or intermediate storage 120 may be used for storing data (e.g.,in bulk form) processed by instances 116. In certain exampleimplementations, database 122 is a dynamoDB that is available as part ofcloud computer system offered by Amazon Web Services (AWS) and theintermediate storage 120 is S3 storage of AWS.

The database 122 may include multiple different database tables. Acontroller table 124A records or keeps track of the status of thecontroller and the overall status of a process (e.g., in the case that areport is being generated, it may keep track of the overall state of thereport). Table 124A may include the following columns shown in Table 1below:

TABLE 1 Field Description Id The instance identifier for the controller(currently only supports one). Date The reporting date for the currentlyexecuting report. Init_Completed Has the initialization phase beencompleted (e.g., all instances started and initial jobs published)Sqrl_completed Has the sqrl file been published to S3.Reported_completed Has the raw data version of the reports completed.Reports_Published Have the customer facing reports been published to theexternal FTP (e.g., 112). Shutdown_Completed Have the worker instancesbeen shut down and the instance table been cleaned up.

Another table 124B (the job status table) may be used to record whatphase a certain process is in. In certain examples, for parallel overallprocesses, it may keep track of how many jobs there are and how manyhave been completed. In certain examples, there is one row per date andoverall process. Table 124B may include the following columns shown inTable 2 below:

TABLE 2 Field Description Date Date for which the report phase executed.Report The name of the report being generated. Phase The phase of thereport currently being executed. Total Total number of jobs that makesup this phase. Started How many of the jobs have started. If this numberis greater than the finished amount it means some jobs are failing andare being retried. Finished How many jobs have been completed.

Another table 124C (the instance job table) may be used to record whatjob an instance (e.g., a worker instance) is executing, when it startedand when it finished. Table 124C may include the following columns shownin Table 3 below:

TABLE 3 Field Description Instance An instance id (e.g., an AWS instanceid) Queue_URL URL of the queue for this worker instance Job Job messageassociated with the most current execution. Started Timestamp when thejob started executing. Finished Timestamp when the job finishedexecuting.

An example job message may include the following values: 1) “command”:“reportctl” (the command parameter specifies the type of work to do),“report”: “mola” (the report parameter specifies the specific type ofwork to do), “phase”: “report” (the phase parameter specifies whichphase the report is current in—each report can have more than onephase), “partition”: 1 (the partition parameter specifies how the datashould be sliced), “date”: “07202016” (the data parameter specifies thewhat day the data should be processed for).

Another table 124D may keep track of heartbeats from the workerinstances 116A-116C. The following columns may be included as shown inTable 3 below:

TABLE 4 Field Description Instance An instance id (e.g., an AWS instanceid) Last_hb YYYY-mm-dd HH:MM:SS of when the instance last reported atimestamp

Description of FIG. 2:

FIG. 2 shows a signal diagram of how the system in FIG. 1 is initializedaccording to certain example embodiments.

At 202, the init phase 202 of the controller process 108 is triggered bycron job 102 that posts a message to the SQS that is associated with thecontroller process 108. An example message posted to the SQS of thecontroller process may include the following fields and data: 1) “type”:“report”; and 2) “date”: “07202106.” The posting of the message via acron job automates the starting of the process, along with automaticallytriggering the subsequent generation of the worker instances and jobsthat are pushed to those instances. The message generated by the cronjob and pushed to the queue may contain the date of the data that theprocess (e.g., a report process) will be analyzing. In response to thismessage, the controller process 108 initializes the system inpreparation for running the jobs. As noted herein, the jobs may be allpart of a larger task or process (e.g., a process that is to bedistributed, in the form of the jobs, across the nodes of system 100)that may be, for example, to generate or run a report on an existingdataset.

One type of job may include the retrospective analysis process shown anddescribed in co-pending application entitled “SYSTEMS AND METHODS OFRETROSPECTIVELY DETERMINING HOW SUBMITTED DATA TRANSACTION REQUESTSOPERATE AGAINST A DYNAMIC DATA STRUCTURE,” attorney docket number4010-424, the entire contents of which are hereby incorporated byreference. With such a job a client can request a report for missedopportunities and each individual job that is tasked to a differentworker instance may be associated with a different ticker symbol (e.g.,where the data operated on by a worker instance includes a datastructure to be analyzed for an order book for a given ticker). Oneworker instance may run the retrospective process for ticker symbol AAAand other for BBB. The output from these multiple different workerinstances may be combined into a report that is shown to the client (asshown in FIG. 5 of the above-noted co-pending application).

At 204, the controller process 108 begins starting worker instances 106.In certain examples this is done by invoking the appropriate API to thecloud computing system 100 to generate a requested number of workerinstances. In certain examples, the controller process 108 may alsodynamically determine the number of worker instances that are to bespawned based on the amount of processing that the jobs to beaccomplished are expected to take. In other words, the number of workerinstances that are needed for a given execution of the controller may behighly variable from one iteration to another (e.g., from one day toanother). Thus, the controller process 108 may request the creation of10 worker instances (or fewer) or the creation of 1000 worker instances(or more) depending on the amount of work that a given job(s) areexpected to take. In certain example embodiments, the number ofinstances may be controlled via a configuration (e.g., a configurationfile).

Once the creation of the worker instances have been initiated (or therequest to create those instances has been acknowledged), then thecontroller waits at 206 until the worker instances have been created.This may include having the controller process 108 query database 122 todetermine if and when the worker instances 116 have successfullyregistered themselves as described in connection with 208 and 210.

At 208, each newly created worker instance 208 creates its owncorresponding work queue (e.g., an SQS) that may be unique for thatqueue. The work queue 114 is generally how data (e.g., a job) iscommunicated to each respective worker instance. In certain examples,the work queues are identified within the cloud computer environment byunique URLs. Once the work queue 114 is created and known, then theworker instance 116 writes or otherwise communicates with database 122to write both the name of the correspondingly created work queue (e.g.,a reference to that queue) and a reference to the corresponding workinstance. This information is communicated to the instance job table124C at 210 where that information is stored for future retrieval.

In certain examples, the processing for 208 and 210 occurs during aninitialization phase for the worker instance. For example, an “init( )”function that is called with the worker instance is first started. Inany event, during this period the controller process 108 continues towait at 206. The waiting may include successive queries to the jobinstance table to determine if the worker instances have reported theirrespective information. Once such information is written to the jobtable, at 212, the controller process queries the database 122 to getreferences to the work queues (e.g., the URL) and/or associated workerinstances. With this list the controller process 108 beginsposting/submitting jobs to the work queues 114 at 214. In certainexample embodiments, the process of posting jobs to the work queues isperformed in a round robin manner.

Description of FIG. 3:

FIG. 3 shows how jobs are submitted to worker instances during a reportgeneration phase according to certain example embodiments.

At 300, the controller process 108 updates the job status table 1248 ofthe database 122 with the total number of jobs for the phase and thephase name.

At 302, the controller process submits jobs as needed to by distributingthem over the available job queues 114 (e.g., by using a round robinalgorithm).

At 304, the worker instances read their respective queues, and at 306the job that was submitted to the queue is executed by the workerinstance. In accordance with starting the job at 306, the correspondingworker instance updates, at 308, the instance information in theinstance job table to indicate what job the worker is running and whenthat job was started by the worker instance.

The processing for the received job is carried out at 312. Once theworker instance has completed the job, it updates the finished column inthe job status table for the corresponding job at 316 and updates orsets the finished time in the job instance table.

In certain example embodiments, when a worker instance fails (e.g.,there is no heartbeat within an amount of time such as, 30 seconds, 1minute, 10 minutes, 1 hour, or a day or more), the controller runs ascript that migrates all jobs from that worker instances to other jobqueues (or starts the process of creating additional worker instances).This includes unfinished jobs that the worker instances was executing aswell as any pending jobs in the work queue for that instance. In otherwords, if a worker instance does not report its heartbeat and the lastreport is outside of a given threshold amount, the controller maymigrate jobs previously assigned to that process to another workerinstance.

In certain examples embodiments, the controller process may include ascript (or the functionality therein) called report_controller.py. Thisscript controls start and stop of the cluster (e.g., all of theinstances—including the controller instance) and report jobs. When notinvoked with the “-init” flag, the following is done: 1) Verifyheartbeats from the worker instances; 2) check if any report hascompleted its current phase and if so, post a job for the next phase; 3)if all reports have finished their work, execute the publish job andshut down the worker instances.

In certain example embodiments, the following script may execute on eachworker instances: worker_agent.py. This script has two functions: 1) runa thread that updates the heartbeat table 124D to indicate that theworker instance is still alive; and 2) read messages from the SQS andexecute them using the reportctl.py script discussed below.

In certain example embodiments, a reportctl.py script may execute oneach worker instance and execute a specific phase and partition of areport job. The script may take the following arguments: 1) report(indicates a given report that is to be run for the data); 2) phase (thephase of the report cycle that is to be executed; 3) date (Date to runthe report for, either a date on ‘mmddyyyy’ format or ‘today’ for thecurrent business day (mutually exclusive with the sqrlfile argument)—thescript may fetch a sqrlfile if no date is present; 4) source (the sourcepath of the report files to process); 5) target (Directory that filesthat are produced by the report phase are written to); 6) sqrlfile (canspecify a sqrlfile instead of a date); 7) partition (Run a report for aspecific partition—The partition is a number that represents apredefined symbol range with −1 being the default to indicate nopartitioning); 8) symbol (Run a report for only one symbol—such as aticker symbol); 9) publishType (Choose from which phase to publish data:report, convert, post, clientfile).

Description of FIG. 4

FIG. 4 is a block diagram of an example computing device 400 (which mayalso be referred to, for example, as a “computing device,” “computersystem,” or “computing system”) according to some embodiments. In someembodiments, the computing device 400 includes one or more of thefollowing: one or more processors 402; one or more memory devices 404;one or more network interface devices 406; one or more displayinterfaces 408; and one or more user input adapters 410. Additionally,in some embodiments, the computing device 400 is connected to orincludes a display device 412. As will explained below, these elements(e.g., the processors 402, memory devices 404, network interface devices406, display interfaces 408, user input adapters 410, display device412) are hardware devices (for example, electronic circuits orcombinations of circuits) that are configured to perform variousdifferent functions for the computing device 400.

In some embodiments, each or any of the processors 402 is or includes,for example, a single- or multi-core processor, a microprocessor (e.g.,which may be referred to as a central processing unit or CPU), a digitalsignal processor (DSP), a microprocessor in association with a DSP core,an Application Specific Integrated Circuit (ASIC), a Field ProgrammableGate Array (FPGA) circuit, or a system-on-a-chip (SOC) (e.g., anintegrated circuit that includes a CPU and other hardware componentssuch as memory, networking interfaces, and the like). And/or, in someembodiments, each or any of the processors 402 uses an instruction setarchitecture such as x86 or Advanced RISC Machine (ARM). As explainedherein, multiple computer systems may collectively form a cloud computersystem and each one of the computer systems is configured to host one ormore virtual machines (which are also referred as instances herein)

In some embodiments, each or any of the memory devices 404 is orincludes a random access memory (RAM) (such as a Dynamic RAM (DRAM) orStatic RAM (SRAM)), a flash memory (based on, e.g., NAND or NORtechnology), a hard disk, a magneto-optical medium, an optical medium,cache memory, a register (e.g., that holds instructions), or other typeof device that performs the volatile or non-volatile storage of dataand/or instructions (e.g., software that is executed on or by processors402). Memory devices 404 are examples of electronic data storage and/ornon-transitory computer-readable storage media.

In some embodiments, each or any of the network interface devices 406includes one or more circuits (such as a baseband processor and/or awired or wireless transceiver), and implements layer one, layer two,and/or higher layers for one or more wired communications technologies(such as Ethernet (IEEE 802.3)) and/or wireless communicationstechnologies (such as Bluetooth, WiFi (IEEE 802.11), GSM, CDMA2000,UMTS, LTE, LTE-Advanced (LTE-A), and/or other short-range, mid-range,and/or long-range wireless communications technologies). Transceiversmay comprise circuitry for a transmitter and a receiver. The transmitterand receiver may share a common housing and may share some or all of thecircuitry in the housing to perform transmission and reception. In someembodiments, the transmitter and receiver of a transceiver may not shareany common circuitry and/or may be in the same or separate housings.

In some embodiments, each or any of the display interfaces 408 is orincludes one or more circuits that receive data from the processors 402,generate (e.g., via a discrete GPU, an integrated GPU, a CPU executinggraphical processing, or the like) corresponding image data based on thereceived data, and/or output (e.g., a High-Definition MultimediaInterface (HDMI), a DisplayPort Interface, a Video Graphics Array (VGA)interface, a Digital Video Interface (DVI), or the like), the generatedimage data to the display device 412, which displays the image data.Alternatively or additionally, in some embodiments, each or any of thedisplay interfaces 408 is or includes, for example, a video card, videoadapter, or graphics processing unit (GPU).

In some embodiments, each or any of the user input adapters 410 is orincludes one or more circuits that receive and process user input datafrom one or more user input devices (not shown in FIG. 4) that areincluded in, attached to, or otherwise in communication with thecomputing device 400, and that output data based on the received inputdata to the processors 402. Alternatively or additionally, in someembodiments each or any of the user input adapters 410 is or includes,for example, a PS/2 interface, a USB interface, a touchscreencontroller, or the like; and/or the user input adapters 410 facilitatesinput from user input devices (not shown in FIG. 4) such as, forexample, a keyboard, mouse, trackpad, touchscreen, etc. . . . .

In some embodiments, the display device 412 may be a Liquid CrystalDisplay (LCD) display, Light Emitting Diode (LED) display, or other typeof display device. In embodiments where the display device 412 is acomponent of the computing device 400 (e.g., the computing device andthe display device are included in a unified housing), the displaydevice 412 may be a touchscreen display or non-touchscreen display. Inembodiments where the display device 412 is connected to the computingdevice 400 (e.g., is external to the computing device 400 andcommunicates with the computing device 400 via a wire and/or viawireless communication technology), the display device 412 is, forexample, an external monitor, projector, television, display screen,etc. . . . .

In various embodiments, the computing device 400 includes one, or two,or three, four, or more of each or any of the above-mentioned elements(e.g., the processors 402, memory devices 404, network interface devices406, display interfaces 408, and user input adapters 410). Alternativelyor additionally, in some embodiments, the computing device 400 includesone or more of: a processing system that includes the processors 402; amemory or storage system that includes the memory devices 404; and anetwork interface system that includes the network interface devices406.

The computing device 400 may be arranged, in various embodiments, inmany different ways. As just one example, the computing device 400 maybe arranged such that the processors 402 include: a multi (orsingle)-core processor; a first network interface device (whichimplements, for example, WiFi, Bluetooth, NFC, etc. . . . ); a secondnetwork interface device that implements one or more cellularcommunication technologies (e.g., 3G, 4G LTE, CDMA, etc. . . . ); memoryor storage devices (e.g., RAM, flash memory, or a hard disk). Theprocessor, the first network interface device, the second networkinterface device, and the memory devices may be integrated as part ofthe same SOC (e.g., one integrated circuit chip). As another example,the computing device 400 may be arranged such that: the processors 402include two, three, four, five, or more multi-core processors; thenetwork interface devices 406 include a first network interface devicethat implements Ethernet and a second network interface device thatimplements WiFi and/or Bluetooth; and the memory devices 404 include aRAM and a flash memory or hard disk.

As previously noted, whenever it is described in this document that asoftware module or software process performs any action, the action isin actuality performed by underlying hardware elements according to theinstructions that comprise the software module. Consistent with theforegoing, in various embodiments, each or any combination of thecontroller node/instances 102, worker instances 116, controller process108, database 122, cloud computer system 100, hardware computer nodes103 n, work queues 114, each of which will be referred to individuallyfor clarity as a “component” for the remainder of this paragraph, areimplemented using an example of the computing device 400 of FIG. 4 (or aplurality of such devices). In such embodiments, the following appliesfor each component: (a) the elements of the 400 computing device 400shown in FIG. 4 (i.e., the one or more processors 402, one or morememory devices 404, one or more network interface devices 406, one ormore display interfaces 408, and one or more user input adapters 410),or appropriate combinations or subsets of the foregoing) are configuredto, adapted to, and/or programmed to implement each or any combinationof the actions, activities, or features described herein as performed bythe component and/or by any software modules described herein asincluded within the component; (b) alternatively or additionally, to theextent it is described herein that one or more software modules existwithin the component, in some embodiments, such software modules (aswell as any data described herein as handled and/or used by the softwaremodules) are stored in the memory devices 404 (e.g., in variousembodiments, in a volatile memory device such as a RAM or an instructionregister and/or in a non-volatile memory device such as a flash memoryor hard disk) and all actions described herein as performed by thesoftware modules are performed by the processors 402 in conjunctionwith, as appropriate, the other elements in and/or connected to thecomputing device 400 (i.e., the network interface devices 406, displayinterfaces 408, user input adapters 410, and/or display device 412); (c)alternatively or additionally, to the extent it is described herein thatthe component processes and/or otherwise handles data, in someembodiments, such data is stored in the memory devices 404 (e.g., insome embodiments, in a volatile memory device such as a RAM and/or in anon-volatile memory device such as a flash memory or hard disk) and/oris processed/handled by the processors 402 in conjunction, asappropriate, the other elements in and/or connected to the computingdevice 400 (i.e., the network interface devices 406, display interfaces408, user input adapters 410, and/or display device 412); (d)alternatively or additionally, in some embodiments, the memory devices402 store instructions that, when executed by the processors 402, causethe processors 402 to perform, in conjunction with, as appropriate, theother elements in and/or connected to the computing device 400 (i.e.,the memory devices 404, network interface devices 406, displayinterfaces 408, user input adapters 410, and/or display device 512),each or any combination of actions described herein as performed by thecomponent and/or by any software modules described herein as includedwithin the component.

Consistent with the preceding paragraph, as one example, in anembodiment where an instance of the computing device 400 is used toimplement controller node 102, the memory devices 404 could load thefiles associated with the controller process, and/or store the datadescribed herein as processed and/or otherwise handed off to work queues114. Processors 402 could be used to operate the controller process 108.

The hardware configurations shown in FIG. 4 and described above areprovided as examples, and the subject matter described herein may beutilized in conjunction with a variety of different hardwarearchitectures and elements. For example: in many of the Figures in thisdocument, individual functional/action blocks are shown; in variousembodiments, the functions of those blocks may be implemented using (a)individual hardware circuits, (b) using an application specificintegrated circuit (ASIC) specifically configured to perform thedescribed functions/actions, (c) using one or more digital signalprocessors (DSPs) specifically configured to perform the describedfunctions/actions, (d) using the hardware configuration described abovewith reference to FIG. 4, (e) via other hardware arrangements,architectures, and configurations, and/or via combinations of thetechnology described in (a) through (e).

Technical Advantages of Described Subject Matter

When working in a cloud based environment it can be technicallyadvantageous if the techniques allow a given task or process to be ableto dynamically scale up and down depending on the workload that isneeded at a given point in time (e.g., the workload may dramaticallyvary from day-to-day). Such techniques should also advantageously beable to reprocess data in cases where one of the worker nodes fails atsome point in the processing. Specifically, if one nodes of the cloudsystems fails the outstanding (or incomplete) work for that nodes shouldbe redistributed to other nodes.

In certain example embodiments, the subject matter described hereinprovides for a dynamic and flexible technique of handling variablecomputer-based processing requirements in a cloud computer system. In atypical scenario, a configuration file or the like keeps a list ofpossible worker instances (and their respective message queues).However, if the nature of the processing to be carried out varies fromday to day, then that same configuration file will need to be updated(usually manually) to account for the changing availability of instancesthat can be used for processing. In certain example embodiments, once aworker instance is started by a controller instance, the worker instanceself-registers with a database. The controller instance may then seethat all of the worker instances that it spawned are ready (e.g., byquerying the database). The controller instance can then retrieve thereferences to the message queues for the individual worker instances andbegin publishing jobs to the queues. This allows the controller to notonly create an arbitrary number of worker instances, but to alsoautomatically submit jobs to those instances without having to manuallyset a reference to the queues for each of the worker instances.

The technical features described herein improve the reliability andflexibility in handling large scale and variable processing problems ina cloud or distributed computing context.

Selected Terminology

Whenever it is described in this document that a given item is presentin “some embodiments,” “various embodiments,” “certain embodiments,”“certain example embodiments, “some example embodiments,” “an exemplaryembodiment,” or whenever any other similar language is used, it shouldbe understood that the given item is present in at least one embodiment,though is not necessarily present in all embodiments. Consistent withthe foregoing, whenever it is described in this document that an action“may,” “can,” or “could” be performed, that a feature, element, orcomponent “may,” “can,” or “could” be included in or is applicable to agiven context, that a given item “may,” “can,” or “could” possess agiven attribute, or whenever any similar phrase involving the term“may,” “can,” or “could” is used, it should be understood that the givenaction, feature, element, component, attribute, etc. is present in atleast one embodiment, though is not necessarily present in allembodiments. Terms and phrases used in this document, and variationsthereof, unless otherwise expressly stated, should be construed asopen-ended rather than limiting. As examples of the foregoing: “and/or”includes any and all combinations of one or more of the associatedlisted items (e.g., a and/or b means a, b, or a and b); the singularforms “a”, “an” and “the” should be read as meaning “at least one,” “oneor more,” or the like; the term “example” is used provide examples ofthe subject under discussion, not an exhaustive or limiting listthereof; the terms “comprise” and “include” (and other conjugations andother variations thereof) specify the presence of the associated listeditems but do not preclude the presence or addition of one or more otheritems; and if an item is described as “optional,” such descriptionshould not be understood to indicate that other items are also notoptional.

As used herein, the term “non-transitory computer-readable storagemedium” includes a register, a cache memory, a ROM, a semiconductormemory device (such as a D-RAM, S-RAM, or other RAM), a magnetic mediumsuch as a flash memory, a hard disk, a magneto-optical medium, anoptical medium such as a CD-ROM, a DVD, or Blu-Ray Disc, or other typeof device for non-transitory electronic data storage. The term“non-transitory computer-readable storage medium” does not include atransitory, propagating electromagnetic signal.

Additional Applications of Described Subject Matter

Although process steps, algorithms or the like, including withoutlimitation with reference to FIGS. 1-3, may be described or claimed in aparticular sequential order, such processes may be configured to work indifferent orders. In other words, any sequence or order of steps thatmay be explicitly described or claimed in this document does notnecessarily indicate a requirement that the steps be performed in thatorder; rather, the steps of processes described herein may be performedin any order possible. Further, some steps may be performedsimultaneously (or in parallel) despite being described or implied asoccurring non-simultaneously (e.g., because one step is described afterthe other step). Moreover, the illustration of a process by itsdepiction in a drawing does not imply that the illustrated process isexclusive of other variations and modifications thereto, does not implythat the illustrated process or any of its steps are necessary, and doesnot imply that the illustrated process is preferred.

Although various embodiments have been shown and described in detail,the claims are not limited to any particular embodiment or example. Noneof the above description should be read as implying that any particularelement, step, range, or function is essential. All structural andfunctional equivalents to the elements of the above-describedembodiments that are known to those of ordinary skill in the art areexpressly incorporated herein by reference and are intended to beencompassed. Moreover, it is not necessary for a device or method toaddress each and every problem sought to be solved by the presentinvention, for it to be encompassed by the invention. No embodiment,feature, element, component, or step in this document is intended to bededicated to the public.

1. A cloud computer system comprising: a plurality of computer devicescoupled together via an electronic data communications network, each ofthe plurality of computer devices having at least one hardware processorand a storage system, where at least one of the plurality of computerdevices is configured as a controller node; a database stored onelectronic data storage; the controller node being programmed to: accepta request to initiate a distributed process that includes a plurality ofdata jobs, determine a number of worker instances to create across theplurality of computer devices, and cause, for the determined number ofworker instance, a plurality of worker instances to be created on theplurality of computer devices; each one of the plurality of workerinstances being programmed to: create a unique message queue for thecorresponding worker instance, and submit, to the database for storagetherein, a reference for the unique message queue that was created forthe corresponding worker instance; the controller node is furtherprogrammed to: retrieve each one of the references to the unique messagequeues for the plurality of created worker instances, and use thereferences to the unique message queues to publish the plurality of datajobs to corresponding ones of the unique message queues; and whereineach one of the plurality of worker instances is further programmed toread at least one data job contained in a corresponding unique messagequeue and process the read at least one data job.
 2. The cloud computersystem of claim 1, wherein each one of the plurality of worker instancesis further programmed to: during processing of a corresponding data job,report a heartbeat signal to the database that indicates that thecorresponding worker instance is working.
 3. The cloud computer systemof claim 2, wherein the controller node is further programmed to:determine, based on stored heart beat signals of the plurality of workerinstances, that a last update for the heartbeat signal for a firstworker instance is longer than a threshold time; and in response todetermination that the first worker instance has not updated itsheartbeat signal, publish the data job(s) that were published to theunique message queue of the first worker instance to another uniquemessage queue that is associated with another worker instance.
 4. Thecloud computer system of claim 1, wherein the creation of the uniquemessage queue and the submission of the reference are both executedduring an initialization function for each corresponding workerinstance.
 5. The cloud computer system of claim 1, wherein thecontroller node is further programmed to: poll the database to determinewhen the reference for the unique message queue of each correspondingworker instance has been stored to the database.
 6. The cloud computersystem of claim 1, wherein the data jobs are published to the uniquemessage queues using a round-robin process.
 7. The cloud computer systemof claim 1, wherein each one of the plurality of worker instances isfurther programmed to: store, to the database and in association withthe reference for the unique message queue, an instance identifier thatuniquely identifies the created worker instance.
 8. The cloud computersystem of claim 1, wherein each one of the plurality of worker instancesis shutdown or destroyed upon completion of the distributed process. 9.The cloud computer system of claim 8, wherein the distributed process isa process to generate at least one report based on an input data set.10. The cloud computer system of claim 1, wherein the database includesa first table and each newly created worker instances is furtherprogrammed to submit a request that generates a new record for the firsttable with at least the following columns: 1) an instance identifier forthe worker instance, 2) the reference for the unique message queue, 3) ajob message that corresponds to a currently executing job for the workerinstance, 4) a timestamp for a start time of the job, and 5) a timestampfor the completion of the job.
 11. A method of operating a cloudcomputer system that includes a plurality of computer devices coupledtogether via an electronic data communications network, each of theplurality of computer devices having at least one hardware processor anda storage system, where at least one of the plurality of computerdevices is configured as a controller node, the method comprising: onthe controller node, accepting a request to initiate a distributedprocess that includes a plurality of data jobs; on the controller node,determining a number of worker instances to create across the pluralityof computer devices; on the controller node, requesting, for thedetermined number of worker instance, a plurality of worker instances tobe created on the plurality of computer devices; on each one of theplurality of worker instances that are executing in response to therequest from the controller node: (a) generating a unique message queuefor the corresponding worker instance, and (b) submitting, to a databaseof the cloud computer system for storage therein, a reference for theunique message queue that was created for the corresponding workerinstance; on the controller node, retrieving, from the database, eachone of the references to the unique message queues for the plurality ofcreated worker instances; on the controller node, using the referencesto the unique message queues to publish the plurality of data jobs tocorresponding ones of the unique message queues; and on each one of theplurality of worker instances, reading at least one data job containedin a corresponding unique message queue and subsequently processing theat least one data job.
 12. The method of claim 11, further comprising:on each worker instance and at least during processing of acorresponding data job, reporting a heartbeat signal to the databasethat indicates that the corresponding worker instance is working. 13.The method of claim 12, further comprising: determining, based on storedheart beat signals of the plurality of worker instances, that a lastupdate for the heartbeat signal for a first worker instance is longerthan a threshold time; and in response to determination that the firstworker instance has not updated its heartbeat signal, publishing thedata job(s) that were published to the unique message queue of the firstworker instance to another unique message queue that is associated withanother worker instance.
 14. The method of claim 11, wherein thecreation of the unique message queue and the submission of the referenceare both executed during an initialization process for eachcorresponding worker instance.
 15. The method of claim 11, furthercomprising: polling the database to determine when the reference for theunique message queue of each corresponding worker instance has beenstored to the database.
 16. The method of claim 11, wherein the datajobs are published to the unique message queues using a round-robinprocess.
 17. The method of claim 11, further comprising: on each of theworker instances, storing, to the database, an instance identifier thatuniquely identifies the created worker instance with the reference forthe unique message queue.
 18. A non-transitory storage medium storinginstructions for use with a cloud computer system that includes aplurality of computer devices coupled together via an electronic datacommunications network, each of the plurality of computer devices havingat least one hardware processor and a storage system, where at least oneof the plurality of computer devices is configured as a controller node,the stored instructions comprising instructions configured to: accept arequest to initiate a distributed process that includes a plurality ofdata jobs; determine a number of worker instances to create across theplurality of computer devices; cause, for the determined number ofworker instance, a plurality of worker instances to be created on theplurality of computer devices; as part of each worker instance, create aunique message queue for the corresponding worker instance; as part ofeach worker instance, submit, to the database for storage therein, areference for the unique message queue that was created for thecorresponding worker instance; as part of the controller instance,retrieve each one of the references to the unique message queues for theplurality of created worker instances; as part of the controllerinstance, use the references to the unique message queues to publish theplurality of data jobs to corresponding ones of the unique messagequeues; and as part of each worker instance, read at least one data jobcontained in a corresponding unique message queue and process the readat least one data job.
 19. The non-transitory storage medium of claim18, wherein the stored instructions comprise further instructions thatare configured to: during processing of a corresponding data job, reporta heartbeat signal to the database that indicates that the correspondingworker instance is working; determine, based on stored heart beatsignals of the plurality of worker instances, that a last update for theheartbeat signal for a first worker instance is longer than a thresholdtime; and in response to determination that the first worker instancehas not updated its heartbeat signal, publish the data job(s) that werepublished to the unique message queue of the first worker instance toanother unique message queue that is associated with another workerinstance.
 20. The non-transitory storage medium of claim 19, wherein thecreation of the unique message queue and the submission of the referenceare both executed during an initialization function for eachcorresponding worker instance.